摘要 :
A programmable calculation unit is proposed on the basis of piece-wise linear approximation. An arbitrary function with single variable is approximately retrieved by piece-wise linear segmentation where parameters for each segment...
展开
A programmable calculation unit is proposed on the basis of piece-wise linear approximation. An arbitrary function with single variable is approximately retrieved by piece-wise linear segmentation where parameters for each segment are pre-stored in a look-up table. An improved scheme of time-encoded stochastic computing elements is developed to efficiently carry out the multiplication and summation of a specific segment. The storage of piece-wise linear parameters is realized by a set of compact multi-valued logic flip-flops. The entire calculation processor is designed with 588 MOS transistors in a 0.18um CMOS technology. For proof-of-concept, three of commonly applied non-linear functions including Tanh, exponential, and rectified linear unit are demonstrated as examples. From circuit simulation results, all the exampled functions are retrieved with the average error less than 2.1% by the proposed processor consuming the energy of 2.78pJ in maximum. The calculation speed reaches ~36× of the state-of-art works.
收起
摘要 :
Recently Low Power (LP) and High Speed (HS) of real-time computing is necessary for various application areas like image processing, neural network, internet of things and Digital Signal Processing (DSP). 86% of the data processin...
展开
Recently Low Power (LP) and High Speed (HS) of real-time computing is necessary for various application areas like image processing, neural network, internet of things and Digital Signal Processing (DSP). 86% of the data processing time in a real-time 3-D graphics system is due to Division (DIV) operation, and MULtiplier (MUL) OPeration (OP). Approximate Multiplier (AMP) is the possible key for hardware efficient and fast MUL OP. In the last 10 years, the APP multiplier becomes a main arithmetic component for many applications. However, development in systematic approach along with the merits and demerits is not relived in the literature survey. Hence this paper not only presents about the development of the AMP architectures design and evolution but also presents research areas in AMP. This systematic study includes the new architectures used by researchers to improve the design of AMP and the advantages over other methods in respect of AMP are also highlighted.
收起
摘要 :
Recently Low Power (LP) and High Speed (HS) of real-time computing is necessary for various application areas like image processing, neural network, internet of things and Digital Signal Processing (DSP). 86% of the data processin...
展开
Recently Low Power (LP) and High Speed (HS) of real-time computing is necessary for various application areas like image processing, neural network, internet of things and Digital Signal Processing (DSP). 86% of the data processing time in a real-time 3-D graphics system is due to Division (DIV) operation, and MULtiplier (MUL) OPeration (OP). Approximate Multiplier (AMP) is the possible key for hardware efficient and fast MUL OP. In the last 10 years, the APP multiplier becomes a main arithmetic component for many applications. However, development in systematic approach along with the merits and demerits is not relived in the literature survey. Hence this paper not only presents about the development of the AMP architectures design and evolution but also presents research areas in AMP. This systematic study includes the new architectures used by researchers to improve the design of AMP and the advantages over other methods in respect of AMP are also highlighted.
收起
摘要 :
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks p...
展开
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.
收起
摘要 :
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks p...
展开
Recently, deep neural network based approaches have emerged as indispensable tools in many fields, ranging from image and video recognition to natural language processing. However, the large size of such newly developed networks poses both throughput and energy challenges to the underlying processing hardware. This could be the major stumbling block to many promising applications such as self-driving cars and smart cities. Existing work proposes to weed zeros from input neurons to avoid unnecessary DNN computation (zero-valued operand multiplications). However, we observe that many output neurons are still ineffectual even if the zero-removal technique has been applied. These ineffectual output neurons could not pass their values to the subsequent layer, which means all the computations (including zero-valued and non-zero-valued operand multiplications) related to these output neurons are futile and wasteful. Therefore, there is an opportunity to significantly improve the performance and efficiency of DNN execution by predicting the ineffectual output neurons and thus completely avoid the futile computations by skipping over these ineffectual output neurons. To do so, we propose a two-stage, prediction-based DNN execution model without accuracy loss. We also propose a uniform serial processing element (USPE), for both prediction and execution stages to improve the flexibility and minimize the area overhead. To improve the processing throughput, we further present a scale-out design for USPE. Evaluation results over a set of state-of-the-art DNNs show that our proposed design achieves 2.5X speedup and 1.9X energy-efficiency on average over the traditional accelerator. Moreover, by stacking with our design, we can improve Cnvlutin and Stripes by 1.9X and 2.0X on average, respectively.
收起
摘要 :
Multiplication is an integral part of most of the signal processing applications. Most of the multipliers require large space, consume more power and are the main source of delay in computer arithmetic. Approximate multiplication ...
展开
Multiplication is an integral part of most of the signal processing applications. Most of the multipliers require large space, consume more power and are the main source of delay in computer arithmetic. Approximate multiplication is extensively used in applications such as multimedia and image processing, which require higher speed and can tolerate some error and imprecision. We are studying accurate compressor and two approximate compressors for Dadda multiplier. We are comparing the hardware complexity of designs, critical path delay, and error rate and power consumption for the three designs.
收起
摘要 :
Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process t...
展开
Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%-18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.
收起
摘要 :
This paper studies an adaptive polygonization method for parametric surfaces that is physically based. A set of sampled points is distributed by this algorithm according to the curvature field of the surface. A triangulation is th...
展开
This paper studies an adaptive polygonization method for parametric surfaces that is physically based. A set of sampled points is distributed by this algorithm according to the curvature field of the surface. A triangulation is then superimposed over this sampled set. The locations of these sampled points are obtained by employing physically based models of interaction of particles. Two physical models are considered in this study: a spring-mass model and a model of electrostatically charged particles. The resulting algorithm equally distributes the approximation error of the triangulation throughout the surface, once the equilibrium state of the physical model is reached.
收起
摘要 :
We present a survey of approximate techniques and discuss concepts for building power-/energy-efficient computing components reaching from approximate accelerators to arithmetic blocks (like adders and multipliers). We provide a s...
展开
We present a survey of approximate techniques and discuss concepts for building power-/energy-efficient computing components reaching from approximate accelerators to arithmetic blocks (like adders and multipliers). We provide a systematical understanding of how to generate and explore the design space of approximate components, which enables a wide-range of power/energy, performance, area and output quality tradeoffs, and a high degree of design flexibility to facilitate their design. To enable cross-layer approximate computing, bridging the gap between the logic layer (i.e. arithmetic blocks) and the architecture layer (and even considering the software layers) is crucial. Towards this end, this paper introduces open-source libraries of low-power and high-performance approximate components. The elementary approximate arithmetic blocks (adder and multiplier) are used to develop multi-bit approximate arithmetic blocks and accelerators. An analysis of data-driven resilience and error propagation is discussed. The approximate computing components are a first steps towards a systematic approach to introduce approximate computing paradigms at all levels of abstractions.
收起
摘要 :
We present a survey of approximate techniques and discuss concepts for building power-/energy-efficient computing components reaching from approximate accelerators to arithmetic blocks (like adders and multipliers). We provide a s...
展开
We present a survey of approximate techniques and discuss concepts for building power-/energy-efficient computing components reaching from approximate accelerators to arithmetic blocks (like adders and multipliers). We provide a systematical understanding of how to generate and explore the design space of approximate components, which enables a wide-range of power/energy, performance, area and output quality tradeoffs, and a high degree of design flexibility to facilitate their design. To enable cross-layer approximate computing, bridging the gap between the logic layer (i.e. arithmetic blocks) and the architecture layer (and even considering the software layers) is crucial. Towards this end, this paper introduces open-source libraries of low-power and high-performance approximate components. The elementary approximate arithmetic blocks (adder and multiplier) are used to develop multi-bit approximate arithmetic blocks and accelerators. An analysis of data-driven resilience and error propagation is discussed. The approximate computing components are a first steps towards a systematic approach to introduce approximate computing paradigms at all levels of abstractions.
收起